Keyboard Logs as Natural Annotations for Word Segmentation

نویسندگان

Fumihiko Takahashi

Shinsuke Mori

چکیده

In this paper we propose a framework to improve word segmentation accuracy using input method logs. An input method is software used to type sentences in languages which have far more characters than the number of keys on a keyboard. The main contributions of this paper are: 1) an input method server that proposes word candidates which are not included in the vocabulary, 2) a publicly usable input method that logs user behavior (like typing and selection of word candidates), and 3) a method for improving word segmentation by using these logs. We conducted word segmentation experiments on tweets from Twitter, and showed that our method improves accuracy in this domain. Our method itself is domain-independent and only needs logs from the target domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and...

متن کامل

The Discovery of Natural Typing Annotations: User-produced Potential Chinese Word Delimiters

Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type "space" or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to ...

متن کامل

Dzongkha Word Segmentation

Dzongkha, the national language of Bhutan, is continuous in written form and it fails to mark the word boundary. Dzongkha word segmentation is one of the fundamental problems and a prerequisite that needs to be solved before more advanced Dzongkha text processing and other natural language processing tools can be developed. This paper presents our initial attempt at segmenting Dzongkha sentence...

متن کامل

Discovering and understanding word level user intent in Web search queries

Identifying and interpreting user intent are fundamental to semantic search. In this paper, we investigate the association of intent with individual words of a search query. We propose that words in queries can be classified as either content or intent, where content words represent the central topic of the query, while users add intent words to make their requirements more explicit. We argue t...

متن کامل

Training Conditional Random Fields Using Incomplete Annotations

We address corpus building situations, where complete annotations to the whole corpus is time consuming and unrealistic. Thus, annotation is done only on crucial part of sentences, or contains unresolved label ambiguities. We propose a parameter estimation method for Conditional Random Fields (CRFs), which enables us to use such incomplete annotations. We show promising results of our method as...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Keyboard Logs as Natural Annotations for Word Segmentation

نویسندگان

چکیده

منابع مشابه

Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

The Discovery of Natural Typing Annotations: User-produced Potential Chinese Word Delimiters

Dzongkha Word Segmentation

Discovering and understanding word level user intent in Web search queries

Training Conditional Random Fields Using Incomplete Annotations

عنوان ژورنال:

اشتراک گذاری